Intro

Text classification is a very common NLP task. Given enough training data, it's relatively easy to build a model that can automatically classify previously unseen texts in a way that follows the logic of the training data. In this post, I'll go through the steps for building such a model. Specifically, I'll leverage the power of the recently released spaCy v3.0 to train two classification models, one for identifying the sentiment of customer reviews in Chinese as being positive or negative (i.e. binary classification) and the other for predicting their product categories in a list of five (i.e. multiclass classification). If you can't wait to see how spaCy v3.0 has made the training process an absolute breeze, feel free to jump to the training the textcat component with CLI section. If not, bear with me on this long journey. All the datasets and models created in this post are hosted in this repo of mine.

Preparing the dataset

Getting the dataset

I'm hoping to build classification models that can take traditional Chinese texts as input, but I can't find any publicly available datasets of customer reviews in traditional Chinese. So I had to make do with reviews in simplified Chinese. Let's first download the dataset using !wget.

!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip
--2021-03-07 14:08:42--  https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/online_shopping_10_cats/online_shopping_10_cats.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4084428 (3.9M) [application/zip]
Saving to: ‘online_shopping_10_cats.zip’

online_shopping_10_ 100%[===================>]   3.89M  --.-KB/s    in 0.1s    

2021-03-07 14:08:43 (37.1 MB/s) - ‘online_shopping_10_cats.zip’ saved [4084428/4084428]

Then we unzip the downloaded file online_shopping_10_cats.zip with, surprisingly, !unzip.

!unzip online_shopping_10_cats.zip
Archive:  online_shopping_10_cats.zip
  inflating: online_shopping_10_cats.csv  

The dataset has three columns: review for review texts, label for sentiment , and cat for product categories. Here's a random sample of five reviews.

import pandas as pd
file_path = '/content/online_shopping_10_cats.csv'
df = pd.read_csv(file_path)
df.sample(5)
cat label review
35479 洗发水 0 买了两套说好的赠品吹风机没给!
35477 洗发水 0 抢购降价一半?坑,爹?没赶上时候?
53299 酒店 1 碰上酒店做活动,加了40元给升级到行政房。房间还不错,比较新。服务员是实习生,不熟练但态度认...
14367 手机 1 1)外观新颖2)拥有强大的多媒体功能和卓越的性能,同时将电池的消耗减到最小,方便了更多的用户...
12549 平板 0 分辨率太低,买的后悔了.

There're in total 62774 reviews.

df.shape
(62774, 3)

The label column has only two unique values, 1 for positive reviews and 0 for negative ones.

df.label.unique()
array([1, 0])

The cat column has nine unique values.

df.cat.unique()
array(['书籍', '平板', '手机', '水果', '洗发水', '热水器', '蒙牛', '衣服', '计算机', '酒店'],
      dtype=object)

Before moving on, let's save the raw dataset to Google Drive. The dest variable can be any GDrive path you like.

dest = "/content/drive/MyDrive/Python/NLP/shopping_comments/"
!cp {file_path} {dest} 

Filtering the datases

Now let's do some data filtering. The groupby function from pandas is very useful, and here's how to get the counts of each of the unique values in the cat column.

df.groupby(by='cat').size()
cat
书籍      3851
平板     10000
手机      2323
水果     10000
洗发水    10000
热水器      575
蒙牛      2033
衣服     10000
计算机     3992
酒店     10000
dtype: int64

To create a balanced dataset, I decided to keep categories whose counts are 10,000. So we're left with five product categories, 平板 for tablets, 水果 for fruits, 洗发水 for shampoo, 衣服 for clothing, and finally 酒店 for hotels.

There're many ways to filter data in pandas, and my favorite is to first create a filt variable that holds a list of True and False, which in this particular case is whether the value in the cat volumn is in the cat_list variable for the categories to be kept. Then we can simply filter data with df[filt]. After filtering, the dataset is reduced to 50,000 reviews.

cat_list = ['平板', '水果', '洗发水', '衣服', '酒店'] 
filt = df['cat'].isin(cat_list)
df = df[filt]
df.shape
(50000, 3)

Now, the dataset is balanced in terms of both the cat and label columnn. There're 10,000 reviews for each product category.

df.groupby(by='cat').size()
cat
平板     10000
水果     10000
洗发水    10000
衣服     10000
酒店     10000
dtype: int64

And there're 25,000 for either of the two sentiments.

df.groupby(by='label').size()
label
0    25000
1    25000
dtype: int64

Having made sure the filtered dataset is balanced, we can now reset the index, and save the dataset as online_shopping_5_cats_sim.csv.

df.reset_index(inplace=True, drop=True)
df.to_csv(dest+"online_shopping_5_cats_sim.csv", sep=",", index=False)

Converting the dataset to traditional Chinese

Let's load back the file we just saved to make sure the dataset is accessible for later use.

df = pd.read_csv(dest+"online_shopping_5_cats_sim.csv")
df.tail()
cat label review
49995 酒店 0 我们去盐城的时候那里的最低气温只有4度,晚上冷得要死,居然还不开空调,投诉到酒店客房部,得到...
49996 酒店 0 房间很小,整体设施老化,和四星的差距很大。毛巾太破旧了。早餐很简陋。房间隔音很差,隔两间房间...
49997 酒店 0 我感觉不行。。。性价比很差。不知道是银川都这样还是怎么的!
49998 酒店 0 房间时间长,进去有点异味!服务员是不是不够用啊!我在一楼找了半个小时以上才找到自己房间,想找...
49999 酒店 0 老人小孩一大家族聚会,选在吴宫泛太平洋,以为新加坡品牌一定很不错,没想到11点30分到前台,...

Next, I converted the reviews from simplified Chinese to traditional Chinese using the OpenCC library.

!pip install OpenCC
Collecting OpenCC
  Downloading https://files.pythonhosted.org/packages/d5/b4/24e677e135df130fc6989929dc3990a1ae19948daf28beb8f910b4f7b671/OpenCC-1.1.1.post1-py2.py3-none-manylinux1_x86_64.whl (1.3MB)
     |████████████████████████████████| 1.3MB 8.0MB/s 
Installing collected packages: OpenCC
Successfully installed OpenCC-1.1.1.post1

OpenCC has many conversion methods. I specifically used s2twp, which converts simplified Chinese to traditional Chinese adpated to Taiwanese vocabulary. The adaptation is not optimal, but it's better than mechanic simplified-to-traditional conversion. Here's a random review in the two writing systems.

from opencc import OpenCC
cc = OpenCC('s2twp') 
test = df.loc[49995, 'review']
print(test)
test_tra = cc.convert(test)
print(test_tra)
我们去盐城的时候那里的最低气温只有4度,晚上冷得要死,居然还不开空调,投诉到酒店客房部,得到的答复是现在还没有领导指示需要开暖气,如果冷到话可以多给一床被子,太可怜了。。。
我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到的答覆是現在還沒有領導指示需要開暖氣,如果冷到話可以多給一床被子,太可憐了。。。

Having made sure the conversion is correct, we can now go ahead and convert all reviews.

df.loc[ : , 'review']  = df['review'].apply(lambda x: cc.convert(x)) 

Let's make the same change to the cat column.

df.loc[ : , 'cat']  = df['cat'].apply(lambda x: cc.convert(x)) 

And then we save the converted dataset as online_shopping_5_cats_tra.csv.

df.to_csv(dest+'online_shopping_5_cats_tra.csv', sep=",", index=False)

Inspecting the dataset

Let's load back the file just saved to make sure it's accessible in the future.

df = pd.read_csv(dest+'online_shopping_5_cats_tra.csv')
df.tail()
cat label review
49995 酒店 0 我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到...
49996 酒店 0 房間很小,整體設施老化,和四星的差距很大。毛巾太破舊了。早餐很簡陋。房間隔音很差,隔兩間房間...
49997 酒店 0 我感覺不行。。。價效比很差。不知道是銀川都這樣還是怎麼的!
49998 酒店 0 房間時間長,進去有點異味!服務員是不是不夠用啊!我在一樓找了半個小時以上才找到自己房間,想找...
49999 酒店 0 老人小孩一大家族聚會,選在吳宮泛太平洋,以為新加坡品牌一定很不錯,沒想到11點30分到前臺,...

Before building models, I would normally inspect the dataset. There're many ways to do so. I recently learned that there's a trick on Colab which allows you to filter a dataset in an interactive manner. All it takes is three lines of code.

%load_ext google.colab.data_table
from google.colab import data_table
data_table.DataTable(df, include_index=False, num_rows_per_page=10)
Warning: total number of rows (50000) exceeds max_rows (20000). Limiting to first max_rows.
cat label review
0 平板 1 很不錯。。。。。。很好的平板
1 平板 1 幫同學買的,同學說感覺挺好,質量也不錯
2 平板 1 東西不錯,一看就是正品包裝,還沒有開機,相信京東,都是老顧客,還是京東值得信賴,給五星好評
3 平板 1 總體而言,產品還是不錯的。
4 平板 1 好,不錯,真的很好不錯
... ... ... ...
49995 酒店 0 我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到...
49996 酒店 0 房間很小,整體設施老化,和四星的差距很大。毛巾太破舊了。早餐很簡陋。房間隔音很差,隔兩間房間...
49997 酒店 0 我感覺不行。。。價效比很差。不知道是銀川都這樣還是怎麼的!
49998 酒店 0 房間時間長,進去有點異味!服務員是不是不夠用啊!我在一樓找了半個小時以上才找到自己房間,想找...
49999 酒店 0 老人小孩一大家族聚會,選在吳宮泛太平洋,以為新加坡品牌一定很不錯,沒想到11點30分到前臺,...

50000 rows × 3 columns

Alternatively, if you'd like to see some sample reviews from all the categories, the groupby function is quite handy. The trick here is to feed pd.DataFrame.sample to the apply function so that you can specify the number of reviews to inspect from each product category.

df.groupby('cat').apply(pd.DataFrame.sample, n=3)[['label', 'review']]
label review
cat
平板 6247 0 這個平板真的是3G的嗎?你們有沒有忽悠唉,為什麼我下了一個百度影片,就卡的要死要活的,跟我以...
1081 1 看網頁玩王者榮耀都很流暢,音質畫面都不錯,就是稍微重了點,綜合性價比還是很好的
1042 1 我覺得還可以,就是把膜貼上了之後,有點滑不動,打遊戲的時候就很煩了
水果 10468 1 還不錯,這個價格比我在外面買的划算,以後還會經常來,個頭不是很大,還可以吧
10986 1 蘋果味道好,就是小了一點,比想象的要小量了一下,好像基本上都沒到70毫米。快遞還是挺快的包裝...
18062 0 這是我在京東消費這麼多年來買到唯一次最爛的東西,還是自營一斤6塊錢的就這貨色還有一個是壞的,...
洗髮水 23271 1 很好,很舒服,清揚就是好用,謝謝老闆,希望一直好用,好好好好好好好好好,快樂
21992 1 一如既往的好 京東速度快 值得信賴 優惠多多
21867 1 京東購物 多快好省 寫評論真的很累 有木有 每次商品很滿意就用這個 各位大佬請放心購買
衣服 30157 1 褲子質量不錯,貨真價實,和店家介紹的基本相符,大小合適,樣式也很滿意,穿上褲子走路感到很輕便,舒服
33190 1 做工精細穿起來很舒服,質量很好
35326 0 質量很差,沒有想象的那麼好
酒店 42243 1 房間是新裝修的,用的是淺色調,感覺很溫馨,佈局較合理,顯的比較寬敞.酒店選用的布草很講究,放...
48082 0 實在忍無可忍。1、水。缺水現象——洗澡至中途,突然斷水,望著滿身的肥皂泡欲哭無淚;多水現象—...
49552 0 下雨天冷,想洗熱水澡,可惜開了半小時水還是冷的

Finally, one of the most powerful ways of exploring a dataset is to use the facets-overview library. Let's first create a column for the length of review texts.

df['len'] = df['review'].apply(len)
df.tail()
cat label review len
49995 酒店 0 我們去鹽城的時候那裡的最低氣溫只有4度,晚上冷得要死,居然還不開空調,投訴到酒店客房部,得到... 86
49996 酒店 0 房間很小,整體設施老化,和四星的差距很大。毛巾太破舊了。早餐很簡陋。房間隔音很差,隔兩間房間... 102
49997 酒店 0 我感覺不行。。。價效比很差。不知道是銀川都這樣還是怎麼的! 29
49998 酒店 0 房間時間長,進去有點異味!服務員是不是不夠用啊!我在一樓找了半個小時以上才找到自己房間,想找... 64
49999 酒店 0 老人小孩一大家族聚會,選在吳宮泛太平洋,以為新加坡品牌一定很不錯,沒想到11點30分到前臺,... 455

Then we install the library.

!pip install facets-overview
Collecting facets-overview
  Downloading https://files.pythonhosted.org/packages/df/8a/0042de5450dbd9e7e0773de93fe84c999b5b078b1f60b4c19ac76b5dd889/facets_overview-1.0.0-py2.py3-none-any.whl
Requirement already satisfied: protobuf>=3.7.0 in /usr/local/lib/python3.7/dist-packages (from facets-overview) (3.12.4)
Requirement already satisfied: pandas>=0.22.0 in /usr/local/lib/python3.7/dist-packages (from facets-overview) (1.1.5)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from facets-overview) (1.19.5)
Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.7/dist-packages (from protobuf>=3.7.0->facets-overview) (1.15.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from protobuf>=3.7.0->facets-overview) (53.0.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.22.0->facets-overview) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.22.0->facets-overview) (2018.9)
Installing collected packages: facets-overview
Successfully installed facets-overview-1.0.0

In order to render an interative visualization of the dataset, we first convert the DataFrame object df to the json format and then add it to an HTML template, as shown below. If you choose len for Binning | X-Axis, cat for Binning | Y-Axis, and finally review for Label By, you'll see all the reviews are beautifully arranged in term of text length along the X axis and product categories along the Y axis. They're also color-coded with respect to sentiment, blue for positive and red for negative. Clicking on a point of either color shows the values of that particular datapoint. Feel free to play around.

from IPython.core.display import display, HTML
jsonstr = df.to_json(orient='records')
HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))